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Abstract 

This paper presents an architecture for the 
generation of spoken monologues with con- 
textually appropriate intonation. A two- 
tiered information structure representation 
is used in the high-level content planning 
and sentence planning stages of generation 
to produce efficient, coherent speech that 
makes certain discourse relationships, such 
as explicit contrasts, appropriately salient. 
The system is able to produce appropriate 
intonational patterns that cannot be gen- 
erated by other systems which rely solely 
on word class and given/new distinctions. 



1 Introduction 

While research on generating coherent written text 
has flourished within the computational linguistics 
and artificial intelligence communities, research on 
the generation of spoken language, and particularly 
intonation, has received somewhat less attention. 
In this paper, we argue that commonly employed 
models of text organization, such as schemata and 
rhetorical structure theory (RST), do not adequately 
address many of the issues involved in generating 
spoken language. Such approaches fail to consider 
contextually bound focal distinctions that are mani- 
fest through a variety of different linguistic and par- 
alinguistic devices, depending on the language. 

In order to account for such distinctions of fo- 
cus, we employ a two-tiered information structure 
representation as a framework for maintaining lo- 
cal coherence in the generation of natural language. 
The higher tier, which delineates the theme, that 
which links the utterance to prior utterances, and 
the rheme, that which forms the core contribution of 
the utterance to the discourse, is instrumental in de- 
termining the high-level organization of information 



within a discourse segment. Dividing semantic rep- 
resentations into their thematic and rhematic parts 
allows propositions to be presented in a way that 
maximizes the shared material between utterances. 

The lower tier in the information structure repre- 
sentation specifies the semantic material that is in 
"focus" within themes and rhemes. Material may be 
in focus for a variety of reasons, such as to empha- 
size its "new" status in the discourse, or to contrast 
it with other salient material. Such focal distinc- 
tions may affect the linguistic presen tati on of infor- 
mation. For example, the it-cleft in [1] may mark 
John as standing in contrast to som e other recently 
mentioned person. Similarly, in (2) , the pitch accent 
on red may mark the referenced car as standing in 
contrast to some other car inferable from the dis- 
course context J^] 

(1) It was John who spoke first. 

(2) Q: Which car did Mary drive? 

A: (mary drove) t /j (the RED ca,r.) rh 

L+H* LH(%) H* LL$ 

By appealing to the notion that the simple rise-fall 
tune (H* LL%) very often accompanies the rhematic 
material in an utterance and the rise-fall-rise tune of- 
ten accompanies the thematic material ( ^teedman 



1991; Prevost and Steedman, 1994), we present a 



spoken language generation architecture for produc- 



1 In this example, and throughout the remainder of 
the paper, the intonation contour is informally noted 
by placing prosodic phrases in parentheses and marking 
pitch accented words with capital letters. T he tunes are 
more form ally annotated with a variant of (|Picrrehum- ' 



jert, 198Q ) notation described in ( |Prevost, 1995| ). Three 



different pause lengths are associated with boundaries 
in the modified notation. '(%)' marks intra-utterance 
boundaries with very little pausing, '%' marks intra- 
utterance boundaries associated with clauses demar- 
cated by commas, and '$' marks utterance-final bound- 
aries. For the purposes of generation and synthesis, these 
distinctions are crucial. 



(3) Q: I know the AMERICAN amplifier produces MUDDY treble, 

(But WHAT) (does the BRITISH amplifier produce?) 
L+H* L(H%) H* LL$ 
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(4) Q: 



I know the AMERICAN amplifier produces MUDDY treble, 
(But WHAT) (produces CLEAN treble?) 
L+H* L(H%) H* LL$ 



A: 


(The 


BRITISH 


amplifier) 






H* 


L(L%) 






rheme-focus 








Rheme 





(produces 


CLEAN 
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ing short spoken monologues with contextually ap- 
propriate intonation. 

2 Information Structure 



Information Structure refers to the organization of 
information within an utterance. In particular, it 
defines how the information conveyed by a sentence 
is related to the knowledge of the interlocutors and 
the structure of their discourse. Sentences conveying 
the same propositional content in different contexts 
need not share the same information structure. That 
is, information structure refers to how the semantic 
content of an utterance is packaged, and amounts 
to instructions for updating the models of the dis- 
course participants. The realization of information 
structure in a sentence, however, differs from lan- 
guage to language. In English, for example, intona- 
tion carries much of the burden of information struc- 
ture, whi le languages with freer word o rder, such as 
Catalan (Engdahl and Vallduvf, 1994 ) and Turkish 
( |Hoffman, 199$ convey information structure syn- 
tactically. 



2.1 Information Structure and Intonation 



contribution of the utterance to the discourse. By 
mapping the rise-fall tune (H* LL%) onto rhemcs 
and the risc-fall-r i se tune (L+H* LH%) onto the mes 
( Bteedman, 1991 ; Prevost and Steedman, 1994 ), we 
can easily identify the string of words over which 
these two prominent tunes occur directly from the 
information structure. While this mapping is cer- 
tainly overly simplistic, the results presented in Sec- 
tion 4.3 demonstrate its appropriateness for the class 
of simple declarative sentences under investigation. 

Knowing the strings of words to which these two 
tunes are to be assigned, however, does not pro- 
vide enough information to determine the location of 
the pitch accents (H* and L+H*) within the tunes. 
Moreover, the simple mapping described above does 
not account for the frequently occurring cases in 
which thematic material bears no pitch accents and 
is consequently unmarked intonationally. Previous 
approaches to the problem of determining where 
to place accents have utilized heuristics based on 
"givenness." That is, content-bearing words (e.g. 
nouns and verbs) which had not been previously 
mentioned (or whose roots had not been previ- 
ously mentioned) were assigned accents, while func- 



Thc relationship between intonational structure and tion words were de-acce nted ( [Davis and Hirschberg 



information structure is illustrated by (U] and ("4"). In 
each of these examples, the answer contains the same 
string words but different intonational patterns and 
information structural representations. The theme 
of each utterance is considered to be represented by 
the material repeated from the question. That is, 
the theme of the answer is what links it to the ques- 
tion and defines what the utterance is about. The 
rheme of each utterance is considered to be repre- 
sented by the material that is new or forms the core 



1988| ; [Hirschberg, 1990| ). While these heuristics ac- 



count for a broad range of intonational possibilities, 
they fail to account for accentual patterns that serve 
to contrast entities or propositions that were previ- 
ously "given" in the discourse. Consider, for ex- 
ample the intonational pattern in |(5)| , in which the 
pitch accent on amplifier in the response cannot be 
attributed to its being "new" to the discourse. 



(5) Q: Do critics prefer the BRITISH amplifier 

L* H 
or the AMERICAN amplifier? 

H* LL$ 
A: They prefer the AMERICAN amplifier. 

H* LL$ 

For the determination of pitch accent placement, 
we rely on a secondary tier of information structure 
which identifies focused properties within themes 
and rhemes. The theme-foci and the rheme-foci 
mark the information that differentiates properties 
or entities in the current utterance from properties 
or entities established in prior utterances. Conse- 
quently, the semantic material bearing "new" infor- 
mation is considered to be in focus. Furthermore, 
the focus may include semantic material that serves 
to contrast an entity or proposition from alterna- 
tive entities or propositions already established in 
the discourse. While the types of pitch accents (H* 
or L+H*) are determined by the theme/rheme delin- 
eation and the aforementioned mapping onto tunes, 
the locations of pitch accents are determined by the 
assignment of foci within the theme and rheme, as 
illustrated in |(3)| and |(4)| . Note that it is in pre- 
cisely those cases where thematic material, which is 
"given" by default, does not contrast with any other 
previously established properties or entities that this 
material is intonationally unmarked, as in |(6)| . 

(6) Q: Which amplifier does Scott PREFER? 

H* LL$ 

A: (He prefers)^ (the BRITISH amplifier.) rfc 

H* LL$ 



2.2 Contrastive Focus Algorithm 

The determination of contrastive focus, and con- 
sequently the determination of pitch accent loca- 
tions, is based on the premise that each object in 
the knowledge base is associated with a set of alter- 
natives from which it must be distinguished if refer- 
ence is to succeed. The set of alternatives is deter- 
mined by the hierarchical structure of the knowledge 
base. For the present implementation, only proper- 
ties with the same parent or grandparent class are 
considered to be alternatives to one another. 

Given an entity x and a referring expression for x, 
the contrastive focus feature for its semantic repre- 
sentation is computed on the basis of the contrastive 
focus algorithm described in 1(7)], |8)] and 1(9)1 The 
data structures and notational conventions are given 
below. 



(7) DElist: a collection of discourse entities 
that have been evoked in prior dis- 
course, ordered by recency. The list 
may be limited to some size k so that 
only the k most recent discourse enti- 
ties pushed onto the list are retrievable. 

ASet(x): the set of alternatives for object 
x, i.e. those objects that belong to 
the same class defined in the 

knowledge base. 

RSet(x, S): the set of alternatives for ob- 
ject x as restricted by the referring ex- 
pressions in DElist and the set of prop- 
erties S. 

CSet(x, S): the subset of properties of S to 
be accented for contrastive purposes. 

Props(x): a list of properties for object x, 
ordered by the grammar so that nomi- 
nal properties take precedence over ad- 
jectival properties. 



The algorithm, which assigns contrastive focus 
in both thematic and rhematic constituents, begins 
by isolating the discourse entities in the given con- 
stituent. For each such entity x, the structures de- 
fined above are initialized as follows: 



(8) Props(x) :— [P | P(x) is true in KB ] 

ASet(x) := {y | alt(x,y)}, x's alternatives 

RSet(x, {}) := {x}U{y \ y e ASet(x) & y G 
DElist}, evoked alternatives 

CSet(x,{}) := {} 



The algorithm appears in pseudo-code in (9)-] 



2 An in-depth discussion of t he algorithm a nd numer- 
ous examples are presented in dPrevost, 1995[ ). 



(9) S:={} 

for each P in Props(x) 
RSet(x,SU{P}) := 

{y \ye RSet(x,S) & P(y)} 
if RSet(x, S U {P}) = RSet(x, S) then 
% no restrictions were made 
% based on property P. 
CSet(x, SU{P})~ CSet{x, S) 
else 

% property P eliminated some 
% members of the RSet. 
CSet(x, S U {P}) := CSet(x, S) U {P} 
endif 

S :=SU{P} 
endfor 



In other words, given an object x, a list of its prop- 
erties and a set of alternatives, the set of alternatives 
is restricted by including in the initial RSet only x 
and those objects that are explicitly referenced in the 
prior discourse. Initially, the set of properties to be 
contrasted ( CSet) is empty. Then, for each property 
of x in turn, the RSet is restricted to include only 
those objects satisfying the given property in the 
knowledge base. If imposing this restriction on the 
RSet for a given property decreases the cardinality 



af the RSet, then the property serves to distinguish 



The selection and organization of propositions 
and their divisions into theme and rheme are de- 
termined by the content planner, which maintains 
discourse coherence by stipulating that semantic in- 
formation must be shared between consecutive utter- 
ances whenever possible. That is, the content plan- 
ner ensures that the theme of an utterance links it 
to material in prior utterances. 

The process of determining foci within themes and 
rhemes can be divided into two tasks: determining 
which discourse entities or propositions are in fo- 
cus, and determining how their linguistic realizations 
should be marked to convey that focus. The first 
of these tasks can be handled in the content phase 
of the NLG model described above. The second of 
these tasks, however, relies on information, such as 
the construction of referring expressions, that is of- 
ten considered the domain of the sentence planning 
stage. For example, although two discourse entities 
ei and e2 can be determined to stand in contrast 
to one another by appealing only to the discourse 
model and the salient pool of knowledge, the method 
of contrastively distinguishing between them by the 
placement of pitch accents cannot be resolved until 
the choice of referring expressions has been made. 
Since referring expressions are generally t aken to be 
in the domain of the sentence planner ( Dale and 



t from other salient alternatives evoked in the prior 

discourse, and is therefore added to the contrast set. 
Conversely, if imposing the restriction on the RSet 
for a given property does not change the RSet, the 
property is not necessary for distinguishing x from 
i ito altornativoo, and io not added to the CSet. 



Haddock, 1991), the present approach resolves is- 



sues of contrastive focus assignment at the sentence 
processing stage as well. 

During the content generation phase, the content 
of the utterance is planned based on the previous 



disco urse. While schema-based systems (McKeown 



Baaed on thia contraative focua algorithm and the 
mapping between information atructurc and into 



1985| ) have been widely us ed, rhetorical structure 



nation dcooribod above, we can view information 

structure as the representational bridge between dis- 
course and intonational variability. The following 
sections elucidate how such a formalism can be in- 
tegrated into the computational task of generating 
spoken language. 

3 Generation Architecture 

The task of natural language generation (NLG) has 
often been divided into three stages: content plan- 
ning, in which high-level goals are satisfied and dis- 
course structure is determined, sentence planning, in 
which high-level abstract semantic representations 
are mapped onto representations that more fully 



theory (RST) ap proaches ( Mann and Thompson 
1986 ; Hovy, 1993|) , which organize texts by idcntify- 



constrain the possible sentential realizations (Ram- 



bow and Korelsky, 1992| ; [Reiter and Mellish, 1992| ; 
Mctcer, 1991 ), and surface generation, in which the 
high-level propositions are converted into sentences. 



ing rhetorical relations between clause-level propo- 
sitions from a knowledge ba se, have recently flour- 
ished. Sibun ( [Bibun, 1991 ) offers yet another al- 
ternative in which propositions are linked to one 
another not by rhetorical relations or pre-planned 
schemata, but rather by physical and spatial prop- 
erties represented in the knowledge-base. 

The present framework for organizing the con- 
tent of a monologue is a hybrid of the schema and 
RST approaches. The implementation, which is pre- 
sented in the following section, produces descrip- 
tions of objects from a knowledge base with context- 
appropriate intonation that makes proper distinc- 
tions of contrast between alternative, salient dis- 
course entities. Certain constraints, such as the re- 
quirement that objects be identified or defined at 
the beginning of a description, are reminiscent of 
McKeown's schemata. Rather than imposing strict 



rules on the order in which information is presented, 
the order is determined by domain specific knowl- 
edge, the communicative intentions of the speaker, 
and beliefs about the hearer's knowledge. Finally, 
the system includes a set of rhetorical constraints 
that may rearrange the order of presentation for in- 
formation in order to make certain rhetorical rela- 
tionships salient. While this approach has proven 
effective in the present implementation, further re- 
search is required to determine its usefulness for a 
broader range of discourse types. 

4 The Prolog Implementation 

The monologue generation program produces text 
and contextually-appropriate intonation contours to 
describe an object from the knowledge base. The 
system exhibits the ability to intonationally contrast 
alternative entities and properties that have been 
explicitly evoked in the discourse even when they 
occur with several intervening sentences. 

4.1 Content Generation 

The architecture for the monologue generation pro- 
gram is shown in Figure |l|, in which arrows repre- 
sent the computational flow and lines represent de- 
pendencies among modules. The remainder of this 
section contains a description of the computational 
path through the system with respect to a single 
example. The input to the program is a goal to de- 
scribe an object from the knowledge base, which in 
this case contains a variety of facts about hypothet- 
ical stereo components. In addition, the input pro- 
vides a communicative intention for the goal which 
may affect its ultimate realization, as shown in |(10)| . 
For example, given the goal describe (x) , the in- 
tention persuade-to-buy (hearer ,x) may result in 
a radically different monologue than the intention 
persuade-to-sell (hearer , x) . 

(10) Goal: describe el 

Input: generate (intention (bel (hi , 
good-to-buy (el) ) ) 

Information from the knowledge base is selected to 
be included in the output by a set of relations that 
determines the degree to which knowledge base facts 
and rules support the communicative intention of 
the speaker. For example, suppose the system "be- 
lieves" that conveying the proposition in (11) mod- 
erately supports the intention of making hearer hi 
want to buy el, and further that the rule in (12) is 
known by hi. 

(11) bel(hl, holds (rating (X, powerful))) 
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(12) holds(rating(X, powerful)) :- 

holds (produce (X , Y) ) , 

holds (isa(Y, watts-per-channel) ) , 

holds (amount (Y, Z) ) , 

number (Z) , 

Z >= 100. 

The program then consults the facts in the knowl- 
edge base, verifies that the property does indeed hold 
and consequently includes the corresponding facts in 
the set of p roper ties to be conveyed to the hearer, 
as shown in |(13)| . 

(13) holds (produce (el , e7)). 
holds(isa(e7, watts-per-channel)) . 
holds (amount (e7, 100)). 

The content generator starts with a simple de- 
scription schema that specifies that an object is to 
be explicitly identified or defined before other propo- 
sitions concerning it are put forth. Other relevant 
propositions concerning the object in question are 
then linearly organized according to beliefs about 
how well they contribute to the overall intention. Fi- 
nally, a small set of rhetorical predicates rearranges 
the linear ordering of propositions so that sets of 
sentences that stand in some interesting rhetorical 
relationship to one another will be realized together 
in the output. These rhetorical predicates employ 
information structure to assist in maintaining the 
coherence of the output. For example, the conjunc- 
tion predicate specifies that propositions sharing the 
same theme or rheme be realized together in order 
to avoid excessive topic shifting. The contrast pred- 
icate specifies that pairs of themes or rhemes that 



explicitly contrast with one another be realized to- 
gether. The result is a set of properties roughly or- 
dered by the degree to which they support the given 
intention, as shown in |(14)[ 

(14) holds (def n(isa(el , amplifier) ) ) 

holds (design (el , solid-state) ,pres) 
holds (cost (el , e9) ,pres) 
holds (produce (el , e7) ,pres) 
holds(contrast(praise(e4,el) , 

revile (e5 , el) ) ,past) 



The top-level propositions shown in (14) 



were se- 
lected by the program because the hearer (hi) is 
believed to be interested in the design of the am- 
plifier and the reviews the amplifier has received. 
Moreover, the belief that the hearer is interested in 
buying an expensive, powerful amplifier justifies in- 
cluding information about its cost and power rat- 
ing. Different sets of propositions would be gener- 
ated for other (perhaps thriftier) hearers. Addition- 
ally, note that the propositions praise (e4 , el) and 
revile (e5 , el) are combined into the larger propo- 
sition contrast (praise (e4, el) ,revile(e5,el)). 
This is accomplished by the rhetorical constraints 
that determine the two propositions to be con- 
trastive because e4 and e5 belong to the same set 
of alternative entities in the knowledge base and 
praise and revile belong to the same set of al- 
ternative propositions in the knowledge base. 

The next phase of content generation recognizes 
the dependency relationships between the proper- 
ties to be conveyed based on shared discourse enti- 
ties. This phase, which represents an extension of 
the rhetorical constraints, arranges propositions to 
ensure that co nsecutive utterances sh are semantic 
material (cf. flMcKeown et al., 199^) ). This rule, 
which in effect imposes a strong bias for Centering 
Theory's continue and retain transitions ( Grosz ct 
al., 1986) determines the theme- rheme segmentation 
for each proposition. 

4.2 Sentence Planning 

After the coherence constraints from the previous 
section are applied, the sentence planner is respon- 
sible for making decisions concerning the form in 
which propositions are realized. This is accom- 
plished by the following simple set of rules. First, 
Definitional isa properties are realized by the ma- 
trix verb. Other isa properties are realized by nouns 
or noun phrases. Top-level properties (such as those 
in (14) ) are realized by the matrix verb. Finally, 



relative clauses. 

While there are certainly a number of linguis- 
tically interesting aspects to the sentence planner, 
the most important aspect for the present purposes 
is the determination of theme-foci and rheme-foci. 
The focus assignment algorithm employed by the 
sentence planner, which has access to both the dis- 
course model and the knowledge base, works as fol- 
lows. First, each property or discourse entity in the 
semantic and information structural representations 
is marked as either previously mentioned or new to 
the discourse. This assignment is made with re- 
spect to two data structures, the discourse entity 
list (DEList), which tracks the succession of entities 
through the discourse, and a similar structure for 
evoked properties. Certain aspects of the semantic 
form are considered unaccentable because they cor- 
respond to the interpretations of closed-class items 
such as function words. Items that are assigned fo- 
cus based on their "newness" are assigned the o focus 



operator, as shown in (15) 



(15) Semantics: 
Theme: 
Rheme: 

Supporting Props: 



defn(isa(oel, ocl)) 
oel 

Xx.isa{x, ocl) 
isa(cl, o amplifier) 
odesign{cl, osolidstate) 



The second step in the focus assignment algorithm 
checks for the presence of contrasting propositions 
in the ISStore, a structure that stores a history of 
information structure representations. Propositions 
are considered contrastive if they contain two con- 
trasting pairs of discourse entities, or if they contain 
one contrasting pair of discourse entities as well as 
contrasting functors. 

Discourse entities are determined to be contrastive 
if they belong to the same set of alternatives in the 
knowledge base, where such sets are inferred from 
the isa-links that define class hierarchies. While the 
present implementation only considers entities with 
the same parent or grandparent class to be alterna- 
tives for the purposes of contrastive stress, a gradu- 
ated approach that entails degrees of contrastiveness 
may also be possible. 

The effects of the focus assignment algorithm are 
easily shown by examining the generation of an ut- 
terance that contrasts with the utterance shown 
That is, suppose the generation program 



15 



embedded properties (those evoked for building re- 
ferring expressions for discourse entities) are realized 
by adjectival modifiers if possible and otherwise by 



has finished gene ratin g the out put c orresponding to 
the examples in (10) through |(15) and is assigned 
the new goal of describing entity e2, a different am- 
plifier. After applying the second step on the focus 
assignment algorithm, contrasting discourse entities 
are marked with the • contrastive focus operator, as 



shown in (16). Since el and e2 are both instances of 
the class amplif iers and cl and c2 both describe 
the class amplif iers itself, these two pairs of dis- 
course entities are considered to stand in contrastive 
relationships. 



(16) Semantics: 
Theme: 
Rheme: 

Supporting Props: 



defn(isa(»e2, »c2)) 
•e2 

Xx.isa(x, »c2) 
class (c2, amplifier) 
design(c2, otube) 



While the previous step of the algorithm deter- 
mined which abstract discourse entities and proper- 
ties stand in contrast, the third step uses the con- 
trastive focus algorithm described in Section || to 
determine which elements need to be contrastively 
focused for reference to succeed. This algorithm de- 
termines the minimal set of properties of an entity 
that must be "focused" in order to distinguish it 
from other salient entities. For example, although 



the representation in (16) specifies that e2 stands 
in contrast to some other entity, it is the property 
of e2 having a tube design rather than a solid-state 
design that needs to be conveyed to the hearer. Af- 
ter app lying the third step of the focus assign ment 



to 



16) 



"tube" 
(17) 



the result appears as shown in 
contrastively focused as desired. 



17)| , with 



Semantics: 

Theme: 

Rheme: 

Supporting Props: 



defn(isa(»e2, »c2)) 
•e2 

Xx.isa(x, »c2) 
isa(c2, amplifier) 
design(c2, •tube) 



The final step in the sentence planning phase of 
generation is to compute a representation that can 
serve as input to a surface form generator based on 
Combinatory Categorial Grammar (CCG) ( [Bteed 



man, 1991| ), as shown in p8)| .P| 



(18) Theme: np(3,s) : 

(el~S)~def(el,»x5(el)kS)<! 



i/rh 



Rheme: s : 

(act~pres)~indef(cl, (amplifier (cl) Ik 
• tube(cl))&zisa(el, cl))\np(3, s) : el@rh 



CCG grammar which encodes the information struc- 
ture/intonation mapping and dictates the genera- 
tion of both the syntactic and prosodic constituents. 
The result is a string of words and the appropriate 
prosodic annotations, as shown in |(19)| . The output 
of this module is easily translated into a form suit- 
able for a speech synthesizer, which produces spoken 
output with the desired intonation.^] 



(19) The X5 

L+H* 



is a TUBE amplifier. 
L(H%) H* LL$ 



The modules described above and shown in Fig- 
ure ^ are implemented in Quintus Prolog. The sys- 



tem produces the types of output shown in (20) 



and 



(21)| , which should be interpreted as a single (two 



paragraph) monologue satisfying a goal to describe 
two different objects .0 Note that both paragraphs 
include very similar types of information, but radi- 
cally different intonational contours, due to the dis- 
course context. In fact, if the intonational patterns 
of the two examples are interchanged, the resulting 
speech sounds highly unnatural. 



(20) a. Describe the x4. 
b. The X4 

L+H* L(H%) 
is a SOLID-state AMPLIFIER. 

H* H* LL$ 

It COSTS EIGHT HUNDRED DOLLARS, 

H* H* H* H* LL% 

and PRODUCES 
H* 

ONE hundred watts-per-CHANNEL. 
H* H* LL$ 

It was PRAISED by STEREOFOOL, 
H* !H* LH% 

an AUDIO JOURNAL, 

H* H* LH% 

but was REVILED by AUDIOFAD, 
H* !H* LH% 

ANOTHER audio journal. 

H* LL$ 



4.3 Results 



Given the focus-marked output of the sentence 
planner, the surface generation module consults a 



3 A compl ete description of the CCG g enerator can 
be found in (Prevost and Steedman, 1993). CCG was 



chosen as the grammatical formalism because it licenses 
non-traditional syntactic constituents that are congruent 
with the bracketings imposed by information structure 
and intonational phrasing, as illustrated in (3). 



4 The system currently uses the AT&T Bell Laborato- 
ries TTS system, but the implementation is easily adapt- 
able to other synthesizers. 

5 The implementation assigns slightly higher pitch to 
accents bearing the subscript c (e.g. H^), which mark 
contrastive focus as determined b y the algorithm de- 
scribe above and in ( |Prevost, 19951 ). 



(21) a. Describe the x5. 

b. The X5 is a TUBE amplifier. 

L+H* L(H%) H* LL$ 

IT costs NINE hundred dollars, 

L+H* L(H%) H* LH% 
produces TWO hundred watts-per-channel. 

H* LH% 
and was praised 
by Stereofool AND Audiofad. 

H*. LL$ 



Several aspects of the output shown above are 
worth noting. Initially, the program assumes that 
the hearer has no specific knowledge of any partic- 
ular objects in the knowledge base. Note however, 
that every proposition put forth by the generator 
is assumed to be incorporated into the hearer's set 
of beliefs. Consequently, the descriptive phrase "an 
audio journal," which is new information in the first 
paragraph, is omitted from the second. Additionally, 
when presenting the proposition 1 Audiofad is an au- 
dio journal,' the generator is able to recognize the 
similarity with the corresponding proposition about 
Stereofool (i.e. both propositions are abstractions 
over the single variable open proposition l X is an au- 
dio journal'). The program therefore interjects the 
other property and produces "another audio jour- 
nal." 

Several aspects of the contrastive intonational ef- 
fects in these examples also deserve attention. Be- 
cause of the content generator's use of the rhetorical 
contrast predicate, items are eligible to receive stress 
in order to convey contrast before the contrast- 
ing items are even mentioned. This phenomenon 
is clearly illustrated by the clause "PRAISED by 
STEREOFOOL" in p0)| , which is contrastively 
stressed before "REVILED by AUDIOFAD" is ut- 
tered. Such situations are produced only when the 
contrasting propositions are gathered by the content 
planner in a single invocation of the generator and 
identified as contrastive when the rhetorical predi- 
cates are applied. Moreover, unlike systems that rely 
solely on word class and given/new distinctions for 
determining accentual patterns, the system is able 
to produce contrastive accents on pronouns despite 



their "given" status, as shown in (21) 



5 Conclusions 

The generation architecture described above and im- 
plemented in Quintus Prolog produces paragraph- 
length, spoken monologues concerning objects in a 
simple knowledge base. The architecture relies on 
a mapping between a two-tiered information struc- 
ture representation and intonational tunes to pro- 
duce speech that makes appropriate contrastive dis- 



tinctions prosodically. The process of natural lan- 
guage generation, in accordance with much of the re- 
cent literature in the field, is divided into three pro- 
cesses: high-level content planning, sentence plan- 
ning, and surface generation. Two points concern- 
ing the role of intonation in the generation process 
are emphasized. First, since intonational phrasing is 
dependent on the division of utterances into theme 
and rheme, and since this division relates consecu- 
tive sentences to one another, matters of information 
structure (and hence intonational phrasing) must 
be largely resolved during the high-level planning 
phase. Second, since accentual decisions are made 
with respect to the particular linguistic realizations 
of discourse properties and entities (e.g. the choice 
of referring expressions), these matters cannot be 
fully resolved until the sentence planning phase. 
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